OcrV1, Main, Exploration, bibRecord, 001458

Word–Wise Script Identification from Indian Documents

Identifieur interne : 001458 ( Main/Exploration ); précédent : 001457; suivant : 001459

Word–Wise Script Identification from Indian Documents

Auteurs : Suranjit Sinha [Inde] ; Umapada Pal [Inde] ; Bidyut Baran Chaudhuri [Inde]

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 2004.

RBID : ISTEX:DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E

Descripteurs français

Pascal (Inist)
- Analyse donnée, Communication écrite, Reconnaissance caractère, Reconnaissance optique caractère, Segment droite, Structure document, Texte, Topologie.

English descriptors

KwdEn :
- Character recognition, Data analysis, Document structure, Line segment, Optical character recognition, Text, Topology, Written communication.

Abstract

Abstract: In a country like India, a single text line of most of the official documents contains two different script words. Under two-language formula, the Indian documents are written in English and the state official language. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different script words before feeding them to the OCRs of individual scripts. In this paper a robust technique is proposed to extract word-wise script identification from Indian doublet form documents. Here, at first, the document is segmented into lines and then the lines are segmented into words. Using different topological and structural features (like number of loops, headline feature, water reservoir concept based features, profile features, etc.) individual script words are identified from the documents. The proposed scheme is tested on 24210 words of different doublets and we received more than 97% accuracy, on average.

Url:

https://api.istex.fr/document/DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E/fulltext/pdf

DOI: 10.1007/978-3-540-28640-0_29

Affiliations:

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Word–Wise Script Identification from Indian Documents</title>
<author><name sortKey="Sinha, Suranjit" sort="Sinha, Suranjit" uniqKey="Sinha S" first="Suranjit" last="Sinha">Suranjit Sinha</name>
</author>
<author><name sortKey="Pal, Umapada" sort="Pal, Umapada" uniqKey="Pal U" first="Umapada" last="Pal">Umapada Pal</name>
</author>
<author><name sortKey="Chaudhuri, B" sort="Chaudhuri, B" uniqKey="Chaudhuri B" first="B." last="Chaudhuri">Bidyut Baran Chaudhuri</name>
<affiliation><country>Inde</country>
<placeName><settlement type="city">Calcutta</settlement>
<region type="province">Bengale-Occidental</region>
</placeName>
<orgName type="lab" n="5">Institut indien de statistiques</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-28640-0_29</idno>
<idno type="url">https://api.istex.fr/document/DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000498</idno>
<idno type="wicri:Area/Istex/Curation">000491</idno>
<idno type="wicri:Area/Istex/Checkpoint">000C93</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Sinha S:word:wise:script</idno>
<idno type="wicri:Area/Main/Merge">001509</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:04-0533877</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000521</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000268</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000446</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Sinha S:word:wise:script</idno>
<idno type="wicri:Area/Main/Merge">001657</idno>
<idno type="wicri:Area/Main/Curation">001458</idno>
<idno type="wicri:Area/Main/Exploration">001458</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Word–Wise Script Identification from Indian Documents</title>
<author><name sortKey="Sinha, Suranjit" sort="Sinha, Suranjit" uniqKey="Sinha S" first="Suranjit" last="Sinha">Suranjit Sinha</name>
<affiliation wicri:level="1"><country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
<wicri:noRegion>Kolkata</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Pal, Umapada" sort="Pal, Umapada" uniqKey="Pal U" first="Umapada" last="Pal">Umapada Pal</name>
<affiliation wicri:level="1"><country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
<wicri:noRegion>Kolkata</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Inde</country>
</affiliation>
</author>
<author><name sortKey="Chaudhuri, B" sort="Chaudhuri, B" uniqKey="Chaudhuri B" first="B." last="Chaudhuri">Bidyut Baran Chaudhuri</name>
<affiliation wicri:level="1"><country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
<wicri:noRegion>Kolkata</wicri:noRegion>
<placeName><settlement type="city">Calcutta</settlement>
<region type="province">Bengale-Occidental</region>
</placeName>
<orgName type="lab" n="5">Institut indien de statistiques</orgName>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2004</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E</idno>
<idno type="DOI">10.1007/978-3-540-28640-0_29</idno>
<idno type="ChapterID">29</idno>
<idno type="ChapterID">Chap29</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Data analysis</term>
<term>Document structure</term>
<term>Line segment</term>
<term>Optical character recognition</term>
<term>Text</term>
<term>Topology</term>
<term>Written communication</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Analyse donnée</term>
<term>Communication écrite</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Segment droite</term>
<term>Structure document</term>
<term>Texte</term>
<term>Topologie</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: In a country like India, a single text line of most of the official documents contains two different script words. Under two-language formula, the Indian documents are written in English and the state official language. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different script words before feeding them to the OCRs of individual scripts. In this paper a robust technique is proposed to extract word-wise script identification from Indian doublet form documents. Here, at first, the document is segmented into lines and then the lines are segmented into words. Using different topological and structural features (like number of loops, headline feature, water reservoir concept based features, profile features, etc.) individual script words are identified from the documents. The proposed scheme is tested on 24210 words of different doublets and we received more than 97% accuracy, on average.</div>
</front>
</TEI>
<affiliations><list><country><li>Inde</li>
</country>
<region><li>Bengale-Occidental</li>
</region>
<settlement><li>Calcutta</li>
</settlement>
<orgName><li>Institut indien de statistiques</li>
</orgName>
</list>
<tree><country name="Inde"><noRegion><name sortKey="Sinha, Suranjit" sort="Sinha, Suranjit" uniqKey="Sinha S" first="Suranjit" last="Sinha">Suranjit Sinha</name>
</noRegion>
<name sortKey="Chaudhuri, B" sort="Chaudhuri, B" uniqKey="Chaudhuri B" first="B." last="Chaudhuri">Bidyut Baran Chaudhuri</name>
<name sortKey="Pal, Umapada" sort="Pal, Umapada" uniqKey="Pal U" first="Umapada" last="Pal">Umapada Pal</name>
<name sortKey="Pal, Umapada" sort="Pal, Umapada" uniqKey="Pal U" first="Umapada" last="Pal">Umapada Pal</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001458 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001458 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E
   |texte=   Word–Wise Script Identification from Indian Documents
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

Serveur d'exploration sur l'OCR

Word–Wise Script Identification from Indian Documents

Word–Wise Script Identification from Indian Documents

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.